| Chapter | Topic | Max points | Book Ch (K.V.) | Book Ch R for HDS | Assignment done? |
|---|---|---|---|---|---|
| 1 | Start me up! | 20 | 1 & 2 | - | ✗ |
| 2 | Regression and model validation | 5+15 | 3 & 4 | 7 | ✗ |
| 3 | Logistic regression | 5+15 | 5 & 6 | 7 | |
| 4 | Clustering and classification | ||||
| 5 | Dimensionality reduction techniques | ||||
| 6 | Analysis of longitudinal data | ||||
| Misc | Misc |
Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.
# This is a so-called "R chunk" where you can write R code.
date()
## [1] "Mon Nov 21 16:10:18 2022"
The text continues here.
Hello, testing this!! ZOM!
Hello World, why are my commits not pushing through?
I was feeling happy and motivated! I got R, Rstudio, Git and Github up and running. Now I see my commits on Github but unfortunately my diary is not updating. Apparently, according to Github, the updating of a public page may take up to 24h. I feel like that is a very long time and I wonder if commits made directly on Github (not remotely with Git) would take as long.
I discovered the IODS-course on an email sent to all doctoral schools. I’m looking forward to learning about rMarkdown, project management in Github, and clustering.
Looking forward to this course!
Update: I got to attend our course IT-help-zoom. It was great! Even though my diary did not update despite our efforts. Apparently everything looks good and my diary should update if I wait some more. The info was reassuring so I’ll be waiting more serenely now!
Meanwhile, I’ll keep myself busy with R for Health Data Science as suggested during the course introduction lecture. It seems very informative and easy to read. I’m somewhat familiar with R but I haven’t used tidyverse much. Piping gets me a bit confused but it’s my favorite topic! Exercise Set 1 is interesting and useful, but it did require some package installations, but I’m good to go now and ready to keep working now that all packages are…packed along!
Update v2 We did it! The diary has updated itself after 2h 13min of me patiently waiting (and eagerly non-stop pushing new commits).
Link to my Github repo https://github.com/hsanez/IODS-project
Yay! A happy girl at a laptop By Marina Vishtak
Describe the work you have done this week and summarize your learning.
date()
## [1] "Mon Nov 21 16:10:18 2022"
Here we go again…
Load the needed R packages before getting to work: Obs! You might need to install the packages first if you haven’t used them before (see install.packages()).
#install.packages(c("tidyverse","GGally","readr"))
library(tidyverse)
library(GGally)
I created a new data folder in my IODS-project folder. It includes an R script named ‘create_learning2014.R’, which is the R code for this data wrangling part of Assignment 2. The original data I’m using is the full learning2014 data downloaded from here. The data includes headers and uses tab as a separator. As per the course assignment instructions I have commented the data and my code in ‘create_learning2014.R’.
✓ Done!
In this part we created an analysis dataset which contains 7 variables: gender, age, attitude, deep, stra, surf and points. Some of these (attitude, deep, stra, surf, points) were made by combining questions in the learning2014 data, as defined in our Exercise Set and also on the bottom part of this page. The combination variables were scaled and observations where points = 0 were excluded. The data has now 166 observations and 7 variables. See my full comments and code in ‘create_learning2014.R’.
✓ Done!
I set the working directory of my R session to the IODS Project folder with setwd() and then saved the analysis dataset to the data folder as learning.csv with write_csv() (readr package, part of tidyverse). ?write_csv was a good help. With read_csv(), str(), dim() and head() I made sure the data was correctly written. Again, see my full comments and code in ‘create_learning2014.R’.(3 points)
✓ Done!
R Markdown Hint: When you knit the document, the working directory is temporarily set to be the folder where your R Markdown file is located. This is good to be aware of when reading data for example.
First we start by reading the analysis data to R. Make sure all required packages are installed (see beginning of Ch2). The data we’re using is the one created in the data wrangling part of assignment 2. The data can be also read directly from this page, where the separator is a comma and headers are included.
#load required packages
library(tidyverse)
library(GGally)
# Read data (Make sure you're in the correct working folder with getwd())
lrn14a <- read_csv("data/learning2014.csv")
# Alternative site for reading the data
#lrn14a <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/learning2014.txt")
#Structure of the data
str(lrn14a)
## spc_tbl_ [166 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ gender : chr [1:166] "F" "M" "F" "M" ...
## $ age : num [1:166] 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num [1:166] 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num [1:166] 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num [1:166] 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num [1:166] 2.58 3.17 2.25 2.25 2.83 ...
## $ points : num [1:166] 25 12 24 10 22 21 21 31 24 26 ...
## - attr(*, "spec")=
## .. cols(
## .. gender = col_character(),
## .. age = col_double(),
## .. attitude = col_double(),
## .. deep = col_double(),
## .. stra = col_double(),
## .. surf = col_double(),
## .. points = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
# Dimensions and first peek at the data
dim(lrn14a);head(lrn14a)
## [1] 166 7
## # A tibble: 6 × 7
## gender age attitude deep stra surf points
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 F 53 3.7 3.58 3.38 2.58 25
## 2 M 55 3.1 2.92 2.75 3.17 12
## 3 F 49 2.5 3.5 3.62 2.25 24
## 4 M 53 3.5 3.5 3.12 2.25 10
## 5 M 49 3.7 3.67 3.62 2.83 22
## 6 F 38 3.8 4.75 3.62 2.42 21
The analysis data is a subset of a full data called ‘learning2014’. The full data is the result of a survey on attitude towards studying and statistics made between 3.12.2014 - 1.10.2015 for students of an introductory course to social statistics.
The results are student’s answers to questions on different subtopics of learning, and some of these subtopics are summarized as new variables in our analysis dataset. This subset we created and read, ‘learning2014.csv’, is an analysis dataset which contains 7 variables: gender, age, attitude, deep, stra, surf and points. Points means the maximum exam points of the student.
Some of these variables were made by combining questions in the learning2014 data
See the survey questions and combinations made for the above mentioned variables on this page.
The summary variables were scaled to a Likert scale (scale from 1 to 5) and observations (students) with 0 exam points were excluded.
The remaining subset of data has 166 observations (students) and 7 variables.
We will look at a graphical overview of our data. We have 7 variables of which gender seems to be a dichotomous variable without missing values, and more female than male students:
summary(lrn14a$gender); unique(lrn14a$gender); table(lrn14a$gender)
## Length Class Mode
## 166 character character
## [1] "F" "M"
##
## F M
## 110 56
We’ll use this information to split our graphical output in two (male and female students) and draw summary plots of the remaining variables:
# access the GGally and ggplot2 libraries
library(GGally)
library(ggplot2)
# create a more advanced plot matrix with ggpairs()
p <- ggpairs(lrn14a, mapping = aes(col=lrn14a$gender, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
# draw the plot
p
In this plot matrix we can see all variables split by gender, boxplots on top, and on the far left histograms, then scatters and density plots, and finally Pearson correlation data with statistical significance marked with asterisks (see GGally documentation).
As we can see, there are
For Pearson correlation results we can see both an overall and gender specific correlation coefficients.
We’ll proceed to creating a regression model. As instructed we’ll choose three explanatory variables to our model where exam points is the dependent (outcome) variable. I’ll base my choice on the correlation information above, and I’ll try to avoid possible collinearity between the chosen variables. Since there are multiple explanatory (independent) variables, we’ll use a multiple regression model approach.
Dependent variable:
Independent variables:
library(ggplot2)
p <- ggpairs(lrn14a, mapping = aes(col=lrn14a$gender, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
# create a regression model with multiple explanatory variables
reg_model <- lm(points ~ attitude + stra + surf, data = lrn14a)
# print out a summary of the model
summary(reg_model)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = lrn14a)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
From the summary of the model we can see that
The t values are t statistics that have been calculated as estimate/standard error.
The F-statistic has a very low p-value, meaning that there is evidence that at least not all of the coefficients of the studied explanatory variables are zero.
R-squared tells us about the fit of the model to our data. Here, since we have used multiple explanatory variables, we’ll look at the adjusted R-squared which takes into account and corrects for the variance caused by the multiple variables used in a regression model. Here we see an adjusted R-square of 0.19, meaning that the three chosen explanatory variables account for approx. 20% of the variation in exam points.
Since only attitude seems to be statistically significantly associated with the dependent variable, we’ll rerun the regression model. We’ll keep attitude as an explanatory variable and instead of surf and stra, we’ll test age and gender as explanatory variables.
# create a regression model with multiple explanatory variables
reg_model_extra <- lm(points ~ attitude + stra + surf, data = lrn14a)
# print out a summary of the model
summary(reg_model_extra)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = lrn14a)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
Since only attitude seems to be statistically significantly associated with the dependent variable, points, we’ll run the regression model once more, this time only with attitude as an explanatory variable, to fit the model better and get a more parsimonious model. We’ll also draw a scatter plot with a regression line to visualize the result.
# a scatter plot of points versus attitude
library(ggplot2)
qplot(attitude, points, data = lrn14a) + geom_smooth(method = "lm")
# create a regression model with multiple explanatory variables
reg_model2 <- lm(points ~ attitude, data = lrn14a)
# print out a summary of the model
summary(reg_model2)
##
## Call:
## lm(formula = points ~ attitude, data = lrn14a)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
The regression line fits the scatter plot quite nicely, with a positive slope and tight 95% confidence intervals. Again, we see that attitude is statistically significantly associated with points (p=4.1E-9), and the amount of points rise by 1 when attitude points based on the original survey rise by approx 3.5. The estimate is a little bit higher and the p-value a little bit lower than before. Since we have only one explanatory variable we can assess the multiple R squared of the model (0.19), which should now equal to the square of the correlation coefficient between attitude and points. This means that almost 20% of points can be explained by attitude.
We have fitted a regression model to our data and interpreted the results. Now we will assess the model to check whether it complies to our statistic assumptions, and that it is a sensible model to use on our data.
Graphical model validation with plot():
# par() could be used to show plots in a matrix
# par(mfrow = c(2,2))
# draw diagnostic plots using the plot() function.
plot(reg_model2, which=c(1,2,5))
Linear regression modeling has four main assumptions (quoted from R for Health Data Science):
List of diagnostic plots for plot() and which():
| which | graphic |
|---|---|
| 1 | Residuals vs Fitted values |
| 2 | Normal QQ-plot |
| 3 | Standardized residuals vs Fitted values |
| 4 | Cook’s distances |
| 5 | Residuals vs Leverage |
| 6 | Cook’s distance vs Leverage |
✓ ✓ ✓ ✓ ✓ Done!
Describe the work you have done this week and summarize your learning.
date()
## [1] "Mon Nov 21 16:10:32 2022"
Here we go yet again…
Obs! You might need to install some packages first if you haven’t used them before (see install.packages()).
#install.packages(c("tidyverse","GGally","readr"))
Load the needed R packages before getting to work:
#load required packages
library(tidyverse)
library(GGally)
library(dplyr)
library(ggplot2)
See create_alc.R on my Github repository.
Read the analysis data.
# Read data (Make sure you're in the correct working folder with getwd())
alc <- read_csv("data/student_alc.csv")
# Alternative site for reading the data
#alc <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/alc.csv")
Describe the data.
# print variable names and dimension of the tibble
colnames(alc); dim(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "failures" "paid" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
## [1] 370 35
The analysis dataset we are using for this course assignment consists of the above listed 35 variables and 370 observations.
Our analysis dataset is a subset of an original data ‘Student Performance Data Set’ by Prof. Cortez (University of Minho, Portugal) collected from two Portuguese schools during the school year 2005-2006. The original dataset was collected using questionnaires and school reports.
The analysis dataset was created by joining two datasets of the original data; student data from two school subjects: Mathematics and Portuguese language. These subject-specific datasets were merged by their common variables, and only students present in both subject datasets were included in the resulting analysis dataset. Naturally there was variation in the following time-related variables:
For the numeric variables we calculated student-specific averages using both of the datasets (Mathematics and Portuguese language). For the binary variable (paid) we chose the value from the Mathematics dataset.
See the R code for the exact course of our data wrangling.
You can find the variable descriptions and more information about (and download) the original dataset on this page.
My chosen variables:
The original dataset includes information on students from 15 to 22 years. This means all student’s aren’t of legal drinking age, which seems to be 18 in Portugal. They won’t be able to buy alcohol by themselves which could be seen in the relationship between age and alcohol use; there should be less alcohol use among younger students.
In average, men are larger than women, and their body composition differs, which makes their physiology and thus alcohol metabolism different from women’s. This results in male tolerance for alcohol being usually better, which may lead to them drinking more before feeling the effects of alcohol. For this possible reason I think there might be a relationship between sex and alcohol use.
Students might spend a lot of time at home both studying or during their freetime. Thus, bad family relations could affect studying possibilities, motivation and recovering from studying. One reason for bad family relationships may be alcohol itself, maybe overused by the student itself or by other family members.
Students drinking more alcohol might have more absences from school. Since weekday alcohol consumption is measured it could be seen as absences during the school week. Friday or Saturday use might not come up as absences in the data, assuming that these schools in Portugal have a regular Monday to Friday school week schedule.
My hypotheses:
# Load needed libraries
library(tidyr)
library(dplyr)
library(ggplot2)
library(GGally)
# glimpse at the data
glimpse(alc); dim(alc)
## Rows: 370
## Columns: 35
## $ school <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP",…
## $ sex <chr> "F", "F", "F", "F", "F", "M", "M", "F", "M", "M", "F", "F",…
## $ age <dbl> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, 15,…
## $ address <chr> "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U",…
## $ famsize <chr> "GT3", "GT3", "LE3", "GT3", "GT3", "LE3", "LE3", "GT3", "LE…
## $ Pstatus <chr> "A", "T", "T", "T", "T", "T", "T", "A", "A", "T", "T", "T",…
## $ Medu <dbl> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3, 4,…
## $ Fedu <dbl> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2, 3,…
## $ Mjob <chr> "at_home", "at_home", "at_home", "health", "other", "servic…
## $ Fjob <chr> "teacher", "other", "other", "services", "other", "other", …
## $ reason <chr> "course", "course", "other", "home", "home", "reputation", …
## $ guardian <chr> "mother", "father", "mother", "mother", "father", "mother",…
## $ traveltime <dbl> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 1, 1,…
## $ studytime <dbl> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1, 1,…
## $ schoolsup <chr> "yes", "no", "yes", "no", "no", "no", "no", "yes", "no", "n…
## $ famsup <chr> "no", "yes", "no", "yes", "yes", "yes", "no", "yes", "yes",…
## $ activities <chr> "no", "no", "no", "yes", "no", "yes", "no", "no", "no", "ye…
## $ nursery <chr> "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "yes…
## $ higher <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "ye…
## $ internet <chr> "no", "yes", "yes", "yes", "no", "yes", "yes", "no", "yes",…
## $ romantic <chr> "no", "no", "no", "yes", "no", "no", "no", "no", "no", "no"…
## $ famrel <dbl> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5, 3,…
## $ freetime <dbl> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5, 1,…
## $ goout <dbl> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5, 3,…
## $ Dalc <dbl> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1,…
## $ Walc <dbl> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1, 4, 3,…
## $ health <dbl> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5, 5,…
## $ failures <dbl> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0,…
## $ paid <chr> "no", "no", "yes", "yes", "yes", "yes", "no", "no", "yes", …
## $ absences <dbl> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, 3, 9, 5,…
## $ G1 <dbl> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11, 14, 16,…
## $ G2 <dbl> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11, 15, 16…
## $ G3 <dbl> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 12, 16, 1…
## $ alc_use <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1.5, 1.0,…
## $ high_use <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
## [1] 370 35
# create a vector of the chosen variables + alcohol use
myvars <- c('age', 'sex', 'famrel', 'absences', 'alc_use', 'high_use')
# draw a bar plot of the chosen variables
gather(alc[myvars]) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
# draw summary table grouped by sex and low/high use of alcohol consumption
alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_age = round(mean(age),1), mean_family_relation_quality = round(mean(famrel),1), mean_absences = round(mean(absences),1))
## # A tibble: 4 × 6
## # Groups: sex [2]
## sex high_use count mean_age mean_family_relation_quality mean_absences
## <chr> <lgl> <int> <dbl> <dbl> <dbl>
## 1 F FALSE 154 16.6 3.9 4.3
## 2 F TRUE 41 16.5 3.7 6.9
## 3 M FALSE 105 16.3 4.1 2.9
## 4 M TRUE 70 16.9 3.8 6.1
# draw summary plot matrix stratified for sex as an overview of the data and correlations
p <- ggpairs(alc[myvars], mapping = aes(col=sex, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
p
library(dplyr); library(ggplot2)
alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_grade = round(mean(G3),1))
## # A tibble: 4 × 4
## # Groups: sex [2]
## sex high_use count mean_grade
## <chr> <lgl> <int> <dbl>
## 1 F FALSE 154 11.4
## 2 F TRUE 41 11.8
## 3 M FALSE 105 12.3
## 4 M TRUE 70 10.3
A NEW MODEL?
Our outcome variable is dichotomous, or binary, so we will assess its relationship with our chosen variables with a logistic regression model.
Dependent variable:
Independent variables:
library(dplyr); library(ggplot2)
# First, we define our logistic regression model m
# We'll use glm(), a generalized linear model function so we need to specify our model type to be logistic -> family="binomial", which makes the function use the link option 'logit'.
m <- glm(high_use ~ age + sex + famrel + absences, data = alc, family = "binomial")
# print out a summary of the model
summary(m)
##
## Call:
## glm(formula = high_use ~ age + sex + famrel + absences, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2151 -0.8371 -0.5950 1.0577 2.1545
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.44606 1.75957 -1.958 0.050175 .
## age 0.17148 0.10306 1.664 0.096143 .
## sexM 1.09171 0.24864 4.391 1.13e-05 ***
## famrel -0.32224 0.12913 -2.495 0.012579 *
## absences 0.08915 0.02298 3.879 0.000105 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 407.03 on 365 degrees of freedom
## AIC: 417.03
##
## Number of Fisher Scoring iterations: 4
# Compute odds ratios (OR) and their confidence intervals (CI) for each variable
OR <- coef(m) %>% exp %>% round(2)
CI <- confint(m) %>% exp %>% round(2)
# print ORs and CIs as a table
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.03 0.00 0.98
## age 1.19 0.97 1.46
## sexM 2.98 1.84 4.89
## famrel 0.72 0.56 0.93
## absences 1.09 1.05 1.15
Present and interpret a summary of the fitted model. Present and interpret the coefficients of the model as odds ratios and provide confidence intervals for them. Interpret the results and compare them to your previously stated hypothesis.
To predict the power of our model we will create a ‘confusion matrix’, a graph (TÄHÄN TARKENNUS), and compute the training error of the model. Finally we’ll compare the performance of the model with performance of XXXX, a simple guessing strategy.
As instructed, we’ll use only the variables that showed statistical significance: sex, famrel, absences. Age had no statistically significant association with alcohol use level based on our regression model.
# predict() the probability of high_use
# We're using our model 'm' as the object of the predict()
# type = 'response' defines that we want our prediction results on the scale of the response variable, instead of the default scale. This means that when dealing with a binomial target variable (e.g. our 'high_use') we will get our prediction results as predicted probabilities instead of logit-scaled probabilities (default for binomial ).
probabilities <- predict(m, type = "response")
library(dplyr)
# add the predicted probabilities to our dataframe 'alc' in a column named 'probability'
alc <- mutate(alc, probability = probabilities)
# use the probabilities to make a prediction of high_use
# Let's set our prediction treshold: If the probability is more than 50%, prediction = TRUE, otherwise it's FALSE.
alc <- mutate(alc, prediction = probabilities >0.5)
# check the last ten original classes, predicted probabilities, and class predictions
select(alc, sex, famrel, absences, high_use, probability, prediction) %>% tail(10)
## # A tibble: 10 × 6
## sex famrel absences high_use probability prediction
## <chr> <dbl> <dbl> <lgl> <dbl> <lgl>
## 1 M 4 3 FALSE 0.387 FALSE
## 2 M 4 0 FALSE 0.405 FALSE
## 3 M 5 7 TRUE 0.437 FALSE
## 4 F 5 1 FALSE 0.132 FALSE
## 5 F 4 6 FALSE 0.247 FALSE
## 6 F 5 2 FALSE 0.165 FALSE
## 7 F 4 2 FALSE 0.187 FALSE
## 8 F 1 3 FALSE 0.398 FALSE
## 9 M 2 4 TRUE 0.568 TRUE
## 10 M 4 2 TRUE 0.406 FALSE
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 247 12
## TRUE 76 35
We can visualize our observations and predictions with a bar plot:
# initialize a plot of 'prediction' (predicted outcomes)
p2 <- ggplot(data = alc, aes(x=prediction))
# draw a bar plot of high_use by 'high use' (observations)
p2 + geom_bar() + facet_wrap("high_use")
As we can see, there are more FALSE observations (True negative
observation = no high use of alcohol, left panel) than TRUE (True
positive observation = high use of alcohol, right panel) observations.
The prediction of true negative observations seems to be better than the
prediction of true positive observations.
We can also visualize our confusion matrix with a ROC (receiver operating characteristic) curve. This will show the performance of our classification model with the true positive and false positive rate.
We’ll visualize the confusion matrix (sensitivity and specificity) with a ROC (receiver operating characteristic) curve. This will help us assess how well our logistic regression model fits the dataset.
The ROC curve is constructed by plotting the true positive rate (TPR) against the false positive rate (FPR).
The closer the curve comes to the upper left corner of the plot, the more accurate the test is. THe closer to the grey 45’ line the curve gets, the less accurate the test is.
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
# Define the curve with function roc(response, predictor), see ?roc() and pROC documentation (Resources)
roc_obj <- roc(alc$high_use, alc$probability)
## Setting levels: control = FALSE, case = TRUE
## Setting direction: controls < cases
# Draw the plot
plot(roc_obj)
We can see that the curve differs from the 45’ curve, which means that
our model seems to have at least some predictive value.
To calculate the training error (=the total proportion of inaccurately classified individuals) of our model, we’ll run the loss function
https://eu01st1.zoom.us/web_client/9zdhk1t/html/externalLinkPage.html?ref=https://daviddalpiaz.github.io/r4sl/the-caret-package.html library(caret)
✓ ✓ ✓
[]
$
(more chapters to be added similarly as we proceed with the course!)